I’m using the same dataset from prior assignments, “Differential effect of SARS-CoV-2 spike glycoprotein 1 on human bronchial and alveolar lung mucosa models: Implications on pathogenicity.”, published December 13, 2021. The corresponding publication can be found on GEO and in the references.
This experiment compares the expression response of human alveolar and bronchial cells to SARS-CoV-2 spike protein through RNA sequencing. Figure 1 from the paper shows the experimental design.
Figure 1. Experiment design. Rahman et al, 2021.
I will be using GSEA 4.1.0 from Docker image risserlin/em_base_image:version_with_bioc_3_13 per lectures and a prior homework. I will be using the current April release of a geneset from the Gary Bader lab for enrichment analysis, containing GO biological processes and pathways but no IEA, Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt
This geneset uses HGNC symbols, so construct the ranked list for GSEA accordingly.
rnk <- output_hits[output_hits$gene != "" & !is.na(output_hits$gene),]
rnk$rank <- -log10(rnk$P.Value) * sign(rnk$logFC)
# See https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Ranked_Gene_Lists
write.table(rnk[,c("gene", "rank")], "GSEA_GSE185657.rnk", sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)
The following commands were run in a shell due to instabilility with RStudio in the Docker image used. The parameters match that used in lectures and the prior homework.
host$ docker run --rm -it --user=rstudio -v $(pwd):/home/rstudio/projects risserlin/em_base_image:version_with_bioc_3_13 bash
rstudio@container$ cd ~/GSEA_4.1.0
rstudio@container$ ./gsea-cli.sh GSEAPreRanked -gmx ../projects/Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt -rnk ../projects/GSEA_GSE185657.rnk -collapse false -nperm 1000 -scoring_scheme weighted -rpt_label A3 -plot_top_x 20 -rnd_seed 12345 -set_max 200 -set_min 15 -zip_report false -out ../projects/ > ../projects/gsea_output
All results have been uploaded to GitHub with the main summary here. They are transcribed below.
312 gene sets are significantly upregulated at nominal pvalue < 5%. Top results:
350 gene sets are significantly downregulated at nominal pvalue < 5%. Top results:
GSEA found fewer significantly enriched gene sets than differentially-expressed genes found via thresholded analysis at the same p-value of 0.05, however, this is not an appropriate comparison as we are comparing gene sets to genes.
GSEA identified similar upregulated and downregulated pathways. The top hits for upregulated processes and pathways are also related to interferon signaling and the viral defence response, from two of the same data sources, Reactome and GO:BP. The top hits for downregulated processes and pathways concern metabolism and translation, from two of the same data sources, Reactome and WikiPathways. Some of the exact same processes and pathways appear in both lists.
This qualititative comparison is not straightforward as I am not comparing the full lists, just the top results. Furthermore, the data sources used are different. MSigDB is only part of the source used for GSEA. The results from g:Profiler also included one query with the entire differentially-expressed gene list, which has no equivalent in GSEA.
I am using Cytoscape 3.9.1 for Windows 64-bit. Through the App Manager, I installed EnrichmentMap Pipeline Collection 1.1.0, which also installs the following apps (also cited in the references):
The following files were copied to a separate Windows folder, since I am running Docker in Windows Subsystem for Linux 2 and not mounting a Windows folder.
Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmtGSEA_GSE185657.rnk produced by this notebook and input to GSEAedb directory from GSEA outputThe following command was executed in Cytoscape’s command line, with all specified paths prefixed with the absolute path of the containing Windows directory, backslashes escaped. This command and the specified thresholds were obtained from these course notes.
enrichmentmap build analysisType="gsea" gmtFile="Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt" pvalue=0.05 qvalue=0.25 similaritycutoff=0.375 coefficients=COMBINED ranksDataset1="edb\\GSEA_GSE185657.rnk" enrichmentsDataset1="edb\\results.edb" filterByExpressions=false expressionDataset1="GSEA_GSE185657.rnk"
To make it more explicit from that command, the thresholds used are:
Legend for all enrichment maps
Click on any of the enrichment maps to enlarge them.
According to “Tools > Analyze Network,” treating it as an undirected graph, there are 309 nodes and 1609 edges.
Using AutoAnnotate on the entire network with largely default settings:
The major themes match the top results from GSEA and thresholded analysis. They fit with the model and show established pathways. The two major connected groups of themes are one upregulated group related to telomeres, mitosis, and homologous recombination, basically cell division; and one group related to downregulation of metabolish in the electron transport chain and translation, while upregulating interferons as part of the viral response.
The enrichment results support the conclusions in the original paper and are very similar to the results from Assignment 2. The authors “observed a typical anti-viral response in the bronchial model and a pro-fibrotic response in the alveolar model.” Like g:Profiler, GSEA also identifies interferon signaling and viral defence response as upregulated and metabolism and translation as downregulated. GSEA and the enrichment maps also indicate that the mitochondrial electron transport chain and steps of mitosis are affected. The largest downregulated cluster is actually related to the electron transport chain. The largest upregulated cluster is actually related to mitosis, which supports the authors’ conclusion regarding a pro-fibrotic response in alveolar cells.
According to Reactome (2010), interferons “are cytokines that play a central role in initiating immune responses, especially antiviral.” A 2018 study by Qu et al noted that mitochondria serve as “a signaling hub for the innate immune response and facilitate downstream signaling leading to [interferon] synthesis.”
The samples in this experiment were treated with just SARS-CoV-2 spike protein. It is well-known that the spike protein of SARS-CoV-2 is used as a target for many current vaccines, especially the mRNA vaccines (CDC link). The spike protein of coronaviruses in general has been used as a target for vaccines and treatments since well before the current global COVID-19 pandemic (Du et al. 2009).
This code comes from the lecture slides.
library(GSA)
capture.output(genesets <- GSA.read.gmt("Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt"),file="gsa_load.out")
names(genesets$genesets) <- genesets$geneset.names
#get all the GSEA directories
gsea_directories <- list.files(pattern = "\\.GseaPreranked")
gsea_dir <- gsea_directories[1]
#get the gsea result files
gsea_results_files <- list.files(path = gsea_dir, pattern = "gsea_report_*.*.tsv")
#there should be 2 gsea results files
enr_file1 <- read.table(file.path(gsea_dir,gsea_results_files[1]),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
enr_file2 <- read.table(file.path(gsea_dir,gsea_results_files[2]),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
#get the genes from the set of enriched pathways (no matter what threshold)
all_enr_genesets<- c(rownames(enr_file1), rownames(enr_file2))
genes_enr_gs <- c()
for(i in 1:length(all_enr_genesets)){
current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_enr_genesets[i])])
genes_enr_gs <- union(genes_enr_gs, current_geneset)
}
FDR_threshold <- 0.001
#get the genes from the set of enriched pathways (no matter what threshold)
all_sig_enr_genesets<- c(rownames(enr_file1)[
which(enr_file1[,"FDR.q.val"]<=FDR_threshold)],
rownames(enr_file2)[which(enr_file2[,"FDR.q.val"]<=FDR_threshold)])
genes_sig_enr_gs <- c()
for(i in 1:length(all_sig_enr_genesets)){
current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_sig_enr_genesets[i])])
genes_sig_enr_gs <- union(genes_sig_enr_gs, current_geneset)
}
genes_all_gs <- unique(unlist(genesets$genesets))
How many significantly enriched genes are in the gene set?
length(genes_sig_enr_gs)
## [1] 1344
if (!require(VennDiagram)) {
install.packages("VennDiagram")
}
library(VennDiagram)
draw.triple.venn(
area1=length(genes_all_gs),
area2=length(genes_enr_gs),
area3 =length(rnk$gene),
n12 = length(intersect(genes_all_gs,genes_enr_gs)),
n13=length(intersect(genes_all_gs,rnk$gene)),
n23 = length(intersect(genes_enr_gs,rnk$gene)),
n123 = length(intersect(genes_all_gs,intersect(genes_enr_gs,rnk$gene))),
category = c(
"all genesets",
"all enrichment results",
"expression"),
fill = c("red","green","blue"),
cat.col = c("red","green","blue")
)
## (polygon[GRID.polygon.1], polygon[GRID.polygon.2], polygon[GRID.polygon.3], polygon[GRID.polygon.4], polygon[GRID.polygon.5], polygon[GRID.polygon.6], text[GRID.text.7], text[GRID.text.8], text[GRID.text.9], text[GRID.text.10], text[GRID.text.11], text[GRID.text.12], text[GRID.text.13], text[GRID.text.14])
These are significantly differentially expressed but weren’t in the annotation data source used. Note that none of these top hits are missing HGNC symbols.
genes_no_annotation <- setdiff(rnk$gene, genes_all_gs)
ranked_gene_no_annotation <- rnk[which(rnk$gene %in% genes_no_annotation),]
knitr::kable(ranked_gene_no_annotation[1:10,], type="html")
| Row.names | gene | logFC | AveExpr | t | P.Value | adj.P.Val | B | rank | |
|---|---|---|---|---|---|---|---|---|---|
| 11213 | ENSG00000188707 | ZBED6CL | 1.687004 | 1.338325 | 3.665809 | 0.0011895 | 0.8315623 | -2.602529 | 2.924641 |
| 1492 | ENSG00000090372 | STRN4 | -6.626449 | 27.454873 | -3.549819 | 0.0015910 | 0.8315623 | -2.706156 | -2.798341 |
| 6511 | ENSG00000145337 | PYURF | -15.009661 | 124.198252 | -3.536024 | 0.0016467 | 0.8315623 | -2.718508 | -2.783375 |
| 12175 | ENSG00000210194 | MT-TE | -2.960040 | 6.106249 | -3.474690 | 0.0019187 | 0.8315623 | -2.773484 | -2.716985 |
| 6141 | ENSG00000141854 | MISP3 | -3.381942 | 6.542874 | -3.411846 | 0.0022427 | 0.8315623 | -2.829890 | -2.649231 |
| 14557 | ENSG00000285230 | RALY-AS1 | -3.366286 | 5.865339 | -3.389710 | 0.0023690 | 0.8315623 | -2.849771 | -2.625436 |
| 12854 | ENSG00000236432 | MFF-DT | 1.260441 | 1.795750 | 3.322266 | 0.0027979 | 0.8315623 | -2.910370 | 2.553169 |
| 12468 | ENSG00000224281 | SLC25A5-AS1 | -1.579658 | 1.870118 | -3.286434 | 0.0030554 | 0.8315623 | -2.942571 | -2.514925 |
| 9897 | ENSG00000175575 | PAAF1 | -8.386556 | 41.826509 | -3.265846 | 0.0032137 | 0.8315623 | -2.961073 | -2.493000 |
| 12496 | ENSG00000225269 | LINC00705 | 2.104523 | 2.212895 | 3.188359 | 0.0038831 | 0.8315623 | -3.030678 | 2.410816 |
top_hits <- output_hits$gene[output_hits$P.Value < 0.05]
dark_matter <- setdiff(top_hits, genes_sig_enr_gs)
dark_matter_ensembl <- output_hits$Row.names[output_hits$gene %in% dark_matter]
heatmap_matrix_tophits <- t(scale(t(normalized_counts[which(rownames(normalized_counts) %in% dark_matter_ensembl),])))
heatmap_col <- circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE, show_row_dend = TRUE,
cluster_columns = FALSE,show_column_dend = FALSE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE)
heatmap
top_hits <- output_hits$gene[output_hits$P.Value < 0.05]
dark_matter <- setdiff(top_hits, genes_all_gs)
dark_matter_ensembl <- output_hits$Row.names[output_hits$gene %in% dark_matter]
heatmap_matrix_tophits <- t(scale(t(normalized_counts[which(rownames(normalized_counts) %in% dark_matter_ensembl),])))
heatmap_col <- circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE, show_row_dend = TRUE,
cluster_columns = FALSE,show_column_dend = FALSE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE)
heatmap
Centers for Disease Control and Prevention. Understanding mRNA COVID-19 Vaccines (2022). Retrieved 2022 Apr 5 from https://www.cdc.gov/coronavirus/2019-ncov/vaccines/different-vaccines/mrna.html
Du, L., He, Y., Zhou, Y. et al. The spike protein of SARS-CoV — a target for vaccine and therapeutic development. Nat Rev Microbiol 7, 226–236 (2009). https://doi.org/10.1038/nrmicro2090
Garapati, PV. Interferon Signaling (2010). Reactome. Retrieved 2022 Apr 5 from https://reactome.org/content/detail/R-HSA-913531
GeneSetEnrichmentAnalysisWiki. Data formats. 2020. Retrieved 2022 Apr 5 from https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
Isserlin R. Enrichment Map Analysis Pipeline. 2020. Retrieved 2022 Apr 5 from https://baderlab.github.io/Cytoscape_workflows/EnrichmentMapPipeline/Protocol2_createEM.html
Merico D, Isserlin R, Stueker O, Emili A, Bader GD. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS One. 2010 Nov 15;5(11):e13984. doi: 10.1371/journal.pone.0013984. PMID: 21085593; PMCID: PMC2981572.
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE. clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics. 2011 Nov 9;12:436. doi: 10.1186/1471-2105-12-436. PMID: 22070249; PMCID: PMC3262844.
Oesper L, Merico D, Isserlin R, Bader GD. WordCloud: a Cytoscape plugin to create a visual semantic summary of networks. Source Code Biol Med. 2011 Apr 7;6:7. doi: 10.1186/1751-0473-6-7. PMID: 21473782; PMCID: PMC3083346.
Rahman M, Irmler M, Keshavan S, Introna M et al. Differential Effect of SARS-CoV-2 Spike Glycoprotein 1 on Human Bronchial and Alveolar Lung Mucosa Models: Implications for Pathogenicity. Viruses 2021 Dec 17;13(12). PMID: 34960806
Reimand J, Isserlin R, Voisin V, Kucera M, Tannus-Lopes C, Rostamianfar A, Wadi L, Meyer M, Wong J, Xu C, Merico D, Bader GD. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc. 2019 Feb;14(2):482-517. doi: 10.1038/s41596-018-0103-9. PMID: 30664679; PMCID: PMC6607905.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003 Nov;13(11):2498-504. doi: 10.1101/gr.1239303. PMID: 14597658; PMCID: PMC403769.
Qu C, Zhang S, Wang W, Li M, Wang Y, van der Heijde-Mulder M, Shokrollahi E, Hakim MS, Raat NJH, Peppelenbosch MP, Pan Q. Mitochondrial electron transport chain complex III sustains hepatitis E virus replication and represents an antiviral target. FASEB J. 2019 Jan;33(1):1008-1019. doi: 10.1096/fj.201800620R. Epub 2018 Aug 2. PMID: 30070932; PMCID: PMC6355081.